Search CORE

13 research outputs found

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Author: Ayad Lorraine A. K.
Loukides Grigorios
Pissis Solon P.
Verbeek Hilde
Publication venue
Publication date: 13/10/2023
Field of study

Sparse suffix sorting is the problem of sorting

b=o(n)

suffixes of a string of length

n

. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in

\mathcal{O}(n\log b)

time, in the worst case, or in

\mathcal{O}(n)

time, when the total number of suffixes with an LCP value greater than

2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1

is in

\mathcal{O}(b/\log b)

, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only

8b+o(b)

machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in

\mathcal{O}(n\log b)

time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.Comment: 16 pages, 1 figur

arXiv.org e-Print Archive

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

Author: Ayad Lorraine A. K.
Badkobeh Golnaz
Fici Gabriele
Heliou Alice
Pissis Solon P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/12/2020
Field of study

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set M{y1,…,yk}ℓ of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set M{y1,…,yk}ℓ contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set Myℓ of minimal absent words of a word y is equal to M{y1,…,yk}ℓ for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available O(n) -time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ∥M{y1,…,yN}ℓ∥=o(n), for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but ∥M{y1,…,yk}12∥≈106. We consider a constant-sized alphabet for stating our results. We show that allMy1ℓ,…,M{y1,…,yk}ℓ can be computed in O(kn+∑N=1k∥M{y1,…,yN}ℓ∥) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y1, … , yk} and MaxOut=max{∥M{y1,…,yN}ℓ∥:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution

Goldsmiths Research Online

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Brunel University Research Archive

Archivio istituzionale della ricerca - Università di Palermo

Constructing Antidictionaries in Output-Sensitive Space

Author: Ayad Lorraine A. K.
Badkobeh Golnaz
Fici Gabriele
Heliou Alice
Pissis Solon P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1,y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ_y_1#...#y_N||=o(n), for all N∈[1,k]. For instance, in the human genome, n ≈ 3× 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y_1,...,y_k} and MaxOut={||M^ℓ_y_1#...#y_N||:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution

arXiv.org e-Print Archive

Goldsmiths Research Online

Crossref

CWI's Institutional Repository

King's Research Portal

Archivio istituzionale della ricerca - Università di Palermo

International Lower Limb Collaborative (INTELLECT) study : a multicentre, international retrospective audit of lower extremity open fractures

Author: Abood A.
Abugarja A.
Albendary Mohamed
Alcaraz J.G.
Ali S.
Andrés-Peiró J.V.
Arnez Z.
Awadelkarim M.
Ayad W.
Ballini L.
Bashir A.
Beijk I.
Bekkers W.J.J.
Berner Juan Enrique
Bhat W.
Bidolegui F.
Biosse-Duplan G.
Botman M.
Bürger H.
Cabañas A.S.
Canahuate S.
Capitani D.
Capitani P.
Castillón P.
Cazzato V.
Cerbone V.
Chan James K.K.
Cherubino M.
Chouhy E.
Chuo Cher Bing
Cifuentes G.V.
Columbrans A.O.
Cooper Kerri
Crick A.
Cuadra A.
Curran T.
Cámara-Cabrera J.
Dafydd H.
Dams R.
de Groot R.
de Jong T.
de la Cruz A.T.
Dearden A.
Demandes H.
Digney C.
Domínguez R.M.
Doussoux P.C.
Eardley W.
Egglestone A.
Elamin S.E.
Elbahari H.
Elbatawy A.
Felipe Peña M.
Fernández Garrido Manuel
Fernández-Poch N.
Ferris S.
Flaherty F.
Garcia-Coiradas J.
Garcia-Sanchez Y.
García C.
García J.M.P.
Gardiner Matthew D.
Garutti L.
Giannakopoulos G.
Giblin V.
Giraldez M.A.
Gohil K.
Gómez M.V.
Hagiga Ahmed
Haley J.
Hamid H.K.S.
Harry Lorraine
Hassan S.
Hausner T.
Ho B.
Holm S.
Hong D.W.
Hong J.P.
Hsu H.
Hughes J.
Ibarra A.
Ibrahim Saidu
Itte V.
Jacobs J.E.D.
Jain Abhilash
Jang M.
Jaureguialzo M.
Jonsson E.L.
Katsura Chie
Kennedy A.
Kilshaw A.
Koide S.
Kooi K.
Kopp L.
Kunc V.
Kuo R.
Kwon J.G.
Lafford G.
Lancerotto L.
Lapolla P.
Layseca Alvaro
Lee C.W.
Lim K.
Louette S.
Lucio A.E.
Lutgendorff F.
López Ortega A.
López B.O.
Macán F.
Mangelsdorff Günther
Marco F.
Marruzzo G.
Martín D.G.
Martínez A.E.
Martínez J.F.P.
Martínez-Carlón M.M.
Masià Jaume
Materazzi G.
McArdle C.
Mengod J.B.
Mingoli A.
Mitchell C.
Mohan Arvind
Mohan M.
Monasterio M.F.
Moral-Nestares R.
Moreno D.
Muñoz D.N.
Nanchahal Jagdeep
Navia Alfonso
Nizamoglu M.
Nolan Grant
Norton S.
Nova M.N.
O'Mahoney L.B.
Oflazoglu K.
Olivella G.M.
Ondoño Navarro A.
Ortega-Briones Alina
Ortiz-Llorens Manuel
Ouf M.
Palma J.
Pascual J.E.
Paulus V.A.A.
Peberdy D.
Pereira Nicolás
Pereyra S.
Plascencia A.R.A.
Poelstra R.
Porcel-Vazquez J.A.
Quadlbauer S.
Quiroga Bilbao M.A.
Qureshi Arham
Rabey N.
Raiola F.
Rajasekaran S.
Rakhorst Hinne A.
Ramírez L.E.E.
Rawlins J.
Reichetseder J.
Requena F.
Ribuffo D.
Rissios J.P.Henríquez
Robinson A.
Rodríguez Astudillo J.R.
Rodríguez A.
Romijn P.
Ros J.M.
Sabapathy S.R.
Samarendra Harsh
Sandhu S.
Santamaria E.
Sañudo B.M.
Seidl E.
Selga-Marsà J.
Shah K.A.
Skillman Joanna
Slade R.
Smith F.
Sperone E.
Standen M.
Surroca M.
Taher S.
Talamonti T.
Tarassoli S.
Teixidor-Serra J.
Tejero D.A.
Tejos Rodrigo
ten Cate W.
Thompson J.
Tinhofer I.E.
To K.
Tomas-Hernandez J.
Toro S.V.
Torres C.A.Z.
Troisi L.
Tromp T.N.
Tzou C.J.
van der Zwaal P.
van Egmond P.W.
van Miltenburg S.
Venegas Josefa
Venkatramani H.
Verra W.
Vizcay M.
Wallis Katy
Wearn C.
Wei N.
West C.
Wolff O.
Wood B.
Wyman M.
Yerson D.
Zamora Paúl David
Publication venue
Publication date: 01/01/2022
Field of study

Diposit Digital de Documents de la UAB

MARS: improving multiple circular sequence alignment using refined sequences

Author: Ayad Lorraine A. K.
Pissis Solon P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/01/2017
Field of study

Abstract Background A fundamental assumption of all widely-used multiple sequence alignment techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts or ends can be totally arbitrary due to a number of reasons: arbitrariness in the linearisation (sequencing) of a circular molecular structure; or inconsistencies introduced into sequence databases due to different linearisation standards. These scenarios are relevant, for instance, in the process of multiple sequence alignment of mitochondrial DNA, viroid, viral or other genomes, which have a circular molecular structure. A solution for these inconsistencies would be to identify a suitable rotation (cyclic shift) for each sequence; these refined sequences may in turn lead to improved multiple sequence alignments using the preferred multiple sequence alignment program. Results We present MARS, a new heuristic method for improving Multiple circular sequence Alignment using Refined Sequences. MARS was implemented in the C++ programming language as a program to compute the rotations (cyclic shifts) required to best align a set of input sequences. Experimental results, using real and synthetic data, show that MARS improves the alignments, with respect to standard genetic measures and the inferred maximum-likelihood-based phylogenies, and outperforms state-of-the-art methods both in terms of accuracy and efficiency. Our results show, among others, that the average pairwise distance in the multiple sequence alignment of a dataset of widely-studied mitochondrial DNA sequences is reduced by around 5% when MARS is applied before a multiple sequence alignment is performed. Conclusions Analysing multiple sequences simultaneously is fundamental in biological research and multiple sequence alignment has been found to be a popular method for this task. Conventional alignment techniques cannot be used effectively when the position where sequences start is arbitrary. We present here a method, which can be used in conjunction with any multiple sequence alignment program, to address this problem effectively and efficiently

PubMed Central

King's Research Portal

Springer OAI

Seedability: optimizing alignment parameters for sensitive sequence comparison

Author: Ayad Lorraine, a K
Chikhi Rayan
Pissis Solon, P
Publication venue: Oxford academic
Publication date: 01/01/2023
Field of study

International audienceMotivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2, use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability, a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0)

INRIA a CCSD electronic archive server

HAL-Pasteur

Degenerate String Comparison and Applications

Author: Alzamel Mai
Ayad Lorraine A. K.
Bernardini Giulia
Grossi Roberto
Iliopoulos Costas S.
Pisanti Nadia
Pissis Solon P.
Rosone Giovanna
Publication venue
Publication date: 01/01/2018
Field of study

A generalised degenerate string (GD string) S is a sequence of n sets of strings of total size N, where the i-th set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S. Finally, proof-of-concept experimental results are presented using real protein datasets

Archivio istituzionale della ricerca - Università di Trieste

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

King's Research Portal